Parallel Text Mining for Large Text Processing

نویسندگان

  • Firat Tekiner
  • Yoshimasa Tsuruoka
  • Jun'ichi Tsujii
  • Sophia Ananiadou
  • John Keane
چکیده

There is an urgent need to develop new text mining solutions using High Performance Computing (HPC) and grid environments to tackle the exponential growth in textual data. Problem sizes are increasing by the day by addition of new text documents. Therefore the aim of this work is to lay the foundations for mining large text datasets (i.e. full text articles) in reasonable timeframes. The task of labelling sequence data such as part-ofspeech (POS) tagging, chunking (shallow parsing) and named entity recognition is one of the most important tasks in Text Mining. This work focuses on state-of-the-art GENIA tagger and STEPP parser. GENIA is a POS tagger which is specifically tuned for biomedical text and STEPP is a full parser. A parallel version of GENIA and STEPP has been developed and performance has been compared on a number of different architectures. The focus has been particularly on scalability: scaling to 512 processors has been achieved. Furthermore, a parallel text mining framework has been proposed that enables scaling to 10000 processors for massively parallel Text Mining applications. The processing times have been reduced dramatically for the given datasets from over 70 days to hours (towards 3 orders of magnitude reduction). The parallel implementation is done using Message Passing Interface (MPI) to achieve portable code. The resulting parallel applications have been tested on a number of architectures and the entire collection of Medline text abstracts together with 125000 full text articles have been used for the tests.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متن‌کاوی در حوزه یادگیری الکترونیکی

As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...

متن کامل

The Web as a Parallel Corpus

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structur...

متن کامل

Mining Large-scale Parallel Corpora from Multilingual Patents: An English-Chinese example and its application to SMT

In this paper, we demonstrate how to mine large-scale parallel corpora with multilingual patents, which have not been thoroughly explored before. We show how a large-scale English-Chinese parallel corpus containing over 14 million sentence pairs with only 1-5% wrong can be mined from a large amount of English-Chinese bilingual patents. To our knowledge, this is the largest single parallel corpu...

متن کامل

ارائه رویکردی برای مدیریت و سازمان‌دهی اسناد متنی با استفاده از تجزیه‌وتحلیل هوشمند متن

Regarding the fact that stored data occupies a large space in organizations and retention systems and information management that has been resulted in gigantic data warehouses, the need for extracting an appropriate model is felt increasingly. Text mining is one of the most significant methods for extracting a useful and appropriate model that helps organizations in achieving their goals throug...

متن کامل

Automated Mining Of Names Using Parallel Hindi-English Corpus

Machine transliteration has a number of applications in a variety of natural language processing related tasks such as machine translation, information retrieval and question-answering. For automated learning of machine transliteration, a large parallel corpus of names in two scripts is required. In this paper we present a simple yet powerful method for automatic mining of HindiEnglish names fr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013